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Abstract 

Bayesian Additive Regression Trees (BART) is a statistical sum 
of trees model. It can be considered a Bayesian version of machine 
learning tree ensemble methods where the individual trees are the base 
learners. However for datasets where the number of variables p is large 
(e.g. p > 5,000) the algorithm can become prohibitively expensive, 
computationally. Another method which is popular for high dimen¬ 
sional data is random forests, a machine learning algorithm which 
grows trees using a greedy search for the best split points. However, 
as it is not a statistical model, it cannot produce probabilistic esti¬ 
mates or predictions. 

We propose an alternative algorithm for BART called BART- 
BMA, which uses Bayesian Model Averaging and a greedy search al¬ 
gorithm to produce a model which is much more efficient than BART 
for datasets with large p. BART-BMA incorporates elements of both 
BART and random forests to offer a model-based algorithm which can 
deal with high-dimensional data. 
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We have found that BART-BMA can be run in a reasonable time 
on a standard laptop for the “small n large p” scenario which is com¬ 
mon in many areas of bioinformatics. We showcase this method using 
simulated data and data from two real proteomic experiments; one to 
distinguish between patients with cardiovascular disease and controls 
and another to classify aggressive from non-aggressive prostate cancer. 
We compare our results to their main competitors. Open source code 
written in R and Repp to run BART-BMA can be found at: 
https://github.com/BelindaHernandez/BART-BMA.git 


1 Introduction 

Advances in technology and data collection have meant that many fields are 
now collecting and analysing bigger datasets than ever before (Lynch, 2008). 
This has brought the analysis of high-dimensional data to the forefront of 
statistical analysis (Biihlmann and Van De Geer 2011 ; Fujikoshi et ah 2011; 
Zhao et ah 2012). In many areas of research, especially biomedical applica¬ 
tions, it is common to have datasets which collect very detailed data on a 
relatively small subset of observations, resulting in what is known as a “small 
n large p" problem, where the number of variables p is often much larger than 
the number of observations n. This precludes the use of many standard sta¬ 
tistical techniques and means that only a restricted set of statistical models 
can be used (Hernandez et ah, 2014). 

Random forests (RF), first proposed by Breiman (2001), is a popular 
method for dealing with high-dimensional data, mainly because of its com¬ 
putational speed and high accuracy. It is a non-parametric method and so 
does not make any major distributional assumptions about the data. RF au¬ 
tomatically allows for non-linear interaction effects, a desirable property in 
many high-dimensional datasets (Nicodemus et ah 2010; Archer and Kimes 
2008). The standard output of the RF method not only reports the accuracy 
of the algorithm, but also gives a variable importance measure for each vari¬ 
able which informs the user as to which variables were the most predictive. 
However, as RF is a machine learning algorithm and does not use a statis¬ 
tical model it cannot provide probability-based uncertainty intervals as in a 
Bayesian setting. 

Bayesian methods have proven popular in many areas of research, in 
part because they are robust to over-htting in the presence of small sample 
sizes and can handle missing or incomplete data. Because a full probabil¬ 
ity model is used, external information from, for example, previous experi¬ 
ments or academic literature can be incorporated into the model, which is 
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an advantage over non-Bayesian statistical and machine learning techniqnes 
(Wilkinson, 2007). Bayesian methods also offer the flexibility to incorporate 
known experimental and biological variability into a prior probability distri- 
bntion (Harris et ah, 2009). A key beneht of using model-based approaches 
is that they give a principled estimate of uncertainty, which can be useful 
in decision-making. Machine learning algorithms usually provide point esti¬ 
mates only and so decisions are made ignoring the uncertainty surrounding 
these estimates. 

Bayesian Additive Regression Trees (BART) (Chipman et ah, 2010) is a 
Bayesian tree ensemble method similar in idea to gradient boosting (Fried¬ 
man, 2001a), which combines the advantages of Bayesian models with those 
of ensemble methods such as RF. The advent of a parallelised R software 
package called bartMachine (Kapelner and Bleich, 2014a) has made BART 
a feasible option for analysis of a wide range of datasets. As BART is a model- 
based approach, it yields credible intervals for predicted values, in contrast 
to machine learning algorithms such as RF. However, as acknowledged in 
their manual (Kapelner and Bleich, 2014b), the algorithm for BART is very 
memory intensive for datasets with large numbers of variables. When tested 
on datasets with p > 5, 000 bartMachine required a minimum of 10GB of 
RAM using its default settings. 

In this article we propose an alternative algorithm for BART which we 
refer to as Bayesian Additive Regression Trees using Bayesian Model Aver¬ 
aging (BART-BMA). BART-BMA modihes the original BART method in a 
number of ways to make the algorithm more efficient for high-dimensional 
data. BART-BMA can be seen as a bridge between RF and BART in that 
it is model-based yet will run on high-dimensional data. One of the main 
reasons the original proposal of BART struggles in high dimensions is that 
it uses Markov Chain Monte Carlo (MCMC) to sample from the posterior 
distribution of the tree space. Rather than using MCMC and saving every 
iteration of the MCMC chain for each tree to memory, BART-BMA uses an 
efficient variant of Bayesian model averaging called Occam’s window (Madi- 
gan and Raftery, 1994). This method discards models with low posterior 
probability and focuses hnal predictions on the subset of models with the 
highest posterior probabilities. In order to improve model selection speed, 
BART-BMA uses a greedy search algorithm to hnd predictive split points, 
so only high quality splits are proposed when growing tree models. Thus 
BART-BMA is computationally feasible for high-dimensional datasets, does 
not require any specialised hardware or software and brings with it the ad¬ 
vantages of a model-based approach. 

In this paper we showcase BART-BMA using a simulated example as well 
as two real applications to proteomic experiments. The article is organised 
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as follows. Section 2 reviews existing tree based variable selection models 
such as RF (Breiman, 2001) and BART (Chipman et ah, 2010). Section 3 
describes our proposed model and explains the differences between it and 
BART. Section 4 compares BART-BMA to BART and their main competitor 
RF for a number of simulated datasets and applies these methods to two 
proteomics datasets. We conclude in Section 5. 

2 Tree-Based Variable Selection 

Tree-based models have long been used for prediction and classihcation, going 
back to 1963 for the analysis of survey data (Morgan and Sonquist 1963; 
Morgan 2005). Tree-based modelling came to the fore with the seminal work 
of Quinlan (1979) and Quinlan (1986), and particularly the Classihcation 
and Regression Tree (CART) method of Breiman et al. (1984). 

Decision trees consist of internal nodes where splitting rules are imposed 
on the data in the form of Xp < c, where Xp refers to the pth variable in 
dataset X and c is a threshold value within the range of values of variable Xp. 
Observations which satisfy the split rule are sent to the left hand daughter 
node and those which do not are sent to the right hand daughter node. 
Observations are iteratively split into left and right hand daughter nodes as 
they pass through each internal node in turn until a terminal node is reached. 

One of the main reasons for the popularity of tree models over standard 
statistical models such as linear regression is that decision trees automati¬ 
cally search for and include non-linear interaction effects. It was later noted 
however that individual decision trees tend to over-fit and be sensitive to 
the training data they were built on. To counteract this, many ensemble 
methods were proposed where multiple diverse models are aggregated or av¬ 
eraged over to give a more stable and generalisable solution (Breiman 1996a; 
Breiman 1996b; Friedman 2001b). 

2.1 Random Forests (RF) 

Random forests (Breiman, 2001) is one of the most popular tree based ensem¬ 
ble algorithms and has been used in many helds (Ham et al. 2005; Svetnik 
et al. 2003; Daz-Uriarte and Alvarez de Andrs 2006). RFs use an average 
of multiple CART decision trees in their method. Each decision tree in the 
RF algorithm is based on a bootstrap sample of observations and trees are 
grown by splitting on a random sample of variables in each internal node. In 
this way RFs avoid over-htting and give a cross-validated estimate of model 
performance by reporting accuracy on the out of bag samples that were not 
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used to build the tree model. 

Rather than using a statistical model, RF performs an exhaustive search 
for split points on various subsets of data. Each tree in the RF is grown to 
maximal depth so that each observation is predicted precisely in its terminal 
node. Therefore individual trees tend to be quite large and complex. The 
main reasons for the popularity of RFs are that they are generally very accu¬ 
rate, are computationally fast and will work for large datasets. Furthermore 
the RF algorithm provides a variable importance score for each variable used 
in the construction of the RF. The two main variable importance scores used 
are the decrease in Gini impurity which is generally used for classihcation 
problems and the mean decrease in accuracy which is generally used for re¬ 
gression problems. Therefore the RF can be used for model prediction and 
explanation unlike other black box algorithms such as support vector ma¬ 
chines and neural networks which do not provide variable importance scores 
by default. 

Another desirable aspect of RF is that it allows access to the trees that are 
averaged over in its model. RF however grows its trees to maximum depth 
and so tends to have very deep trees where many of the splits do not add to 
the predictive accuracy, thus making individual trees hard to interpret. 

2.2 Bayesian Additive Regression Trees (BART) 

BART is a Bayesian tree ensemble method where the response variable Y is 
estimated by a sum of Bayesian CART trees (Chipman et ah, 2010). Given 
an n X p matrix of explanatory variables X, let Xk = [xki ,..., Xkj^ be the kth. 
row (i.e. the /cth observation) of X. The basic BART model is 

m 

Yk = 'Yhsi.Xk] Tj, Mj) + Ek, ( 1 ) 

i=i 

where g{xk) is a Bayesian CART decision tree model as described in Chipman 
et al. (1998), Tj refers to decision tree j = 1.. .m where m is the total number 

of trees in the model, Tj has terminal node parameters Mj, and e ~ -^(0, cr^) 
where is the residual variance. The original BART model is htted via a 
back-htting Gibbs sampler which draws from the joint posterior distribution 
of all the trees and terminal node parameters and the standard deviation, 
given the data (Chipman et ah, 2010). 

Each tree Tj is iteratively £t based on the residuals of the previous trees 
at the current iteration of the Gibbs sampler until a predetermined number 
of iterations is reached. Assuming prior independence of the trees Tj, the 
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terminal node parameters G Mj (where i = 1.. .bj indexes the terminal 
nodes of tree j) and a, the posterior distribution is 


p{T,M,a\X,Y)(xp{Y\X,T,M,a) 


. j i 


Pi(^), 


( 2 ) 


where p{Y\X,T, M,a) is the overall likelihood of the sum of trees model. 
In (2), each tree Tj has terminal node parameters Mj, p{fiij\Tj) is the prior 
distribution of the terminal node parameters pij G Mj given the tree struc¬ 
ture Tj and p{a) is the prior distribution of a. p(Tj) is the prior on tree Tj. 
BART places the same regularisation prior on the tree size and shape as that 
of Chipman et ah (1998) to deter any one tree from having undue influence 
over the sum of trees model. The probability that a given terminal node is 
further split into two children nodes is Psplit = Q:{1 + d)~l^, where d is the 
depth of internal node i and a and 13 are parameters which determine the 
size and shape for tree respectively. 

The prior distribution of the terminal node parameters pij is Pij\Tj ~ 
7V(0,(To), where ao = and e is a user-specihed parameter with recom¬ 
mended values between 1 and 3. To set the prior distribution of the pij, 
Chipman et al. (2010) noted that E(y|X) is modelled as a sum of m pij 
parameters. BART then uses a data-informed prior for the pij that implies 
that E(R|X) lies within the range of values of Y with high probability. Here 
the response variable Y is centered at zero and scaled to have minimum and 
maximum values at —0.5 and 0.5 respectively before analysis. For this reason 
the prior mean of pij is set to 0. The prior terminal node variance Uo is then 
set such that ey/mao = 0.5, and e is chosen such that the prior implies that 
E(R|X) lies between —0.5 and 0.5 with high probability. For example e = 2 
sets a 95% prior probability that the K(Yscaied\X) lies between -0.5 and 0.5. 

BART assumes that the model precision r has prior distribution r ~ 
Ga(|, ^), where r = with degrees of freedom u and scale A respectively. 
This gives rise to the following full conditional distribution of the partial 
residuals Rj\... = Y — Yl^j Mi) for each tree Tj-. 


p{Rj\X,Tj,T) oc fniT-h 


2=1 


0.5 


e\/m 


T 


-1 


T 


exp 


Rlj + exp 




(3) 


. <.=1 


2 (riiT + 


0.5 


* eyfm 


where n* is the number of observations in terminal node i of tree j and Rij 
is the mean of the tree response variable Rj for terminal node i. The termi- 
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nal node parameters Mj have been marginalised over in equation (3). This 
reduces the computational cost of BART as the calculation of the likelihood 
for each tree Tj does not have to take account of the changing dimensionality 
of the terminal node parameters as trees are grown and pruned. 

New trees in the BART algorithm are proposed in the Gibbs sampler 
by selecting from one of four proposal moves: GROW (node birth), PRUNE 
(node death), GHANGE (changing split rules) and SWAP (swapping internal 
nodes). The algorithm for BART is shown in Algorithm 1. 


2.3 BART for Classification 

Por binary classihcation BART follows the latent variable probit approach of 
Albert and Ghib (1993). Latent variables are introduced so that 


Yk 


1 if Zfc > 0 
0 otherwise. 


The sum of trees prior is then placed on the so that Z^ N(ZT=i9ixk;T^:M^),l 
It follows that that W is Bernoulli with 


P{Yk 


m 


l\xk) = <h y^g{xk]Tj,M. 


0=1 


(4) 


where <h is the standard normal cumulative distribution function, used here 
as the link function. Note that there is no residual variance parameter in 
the classihcation version of the model. 


The full posterior distribution is then 


p(T, M, Z|X, U) oc p{Y\Z)p{Z\X, T, M) 


\{\{p{p^^T,)p{T,) 


(5) 


where the top level of the likelihood (i.e. the hrst term on the right hand 
side) is a deterministic function of the latent variables. The conditional prior 
distributions of the terminal node parameters Pij\Tj are set in the same way 
as that described in Section 2.2 except that an = —^ instead of an = 

This is in order to assign high prior probability to the interval (<h[—3], *h[3]) 
which corresponds to the 0.1% and 99.9% quantiles of the normal c.d.f.. 
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Algorithm 1: BART Algorithm 


Input: n X p matrix X with response variable Y 
Output: Credible interval for R, after burn-in updates for a 
Initialize residuals i?oo = Y 
Set tree Tq to a stump 
for 7 ■(— 1 £o niter do 
for j ^ 1 to m do 

1. Generate a proposal tree T* by choosing from 
one of the following proposal moves: 

• GROW 

• PRUNE 

• GRANGE 

• SWAP 

2. Set = T* with probability 

q-IT'i' t*] = i\ 

Else set = T] 

3. Update terminal node parameters Mj by 
drawing from p{Mj\Tj, Rj, a) 

4. Update a parameter by drawing from 

p(aiM,T,R) 

5. Update predicted values for 

6. Update = U - M^) 


Setp; = Er=i^?(^;^i>^i) 


return Gredible interval for V, after burn-in updates for a 





The latent variables introduce an extra step in the Gibbs algorithm 
to £t the model. The full conditional distributions oi Zk \... are: 


I min 



N (j2j9{xk]Tj,Mj),l^ ,0 


if Yk = 1, 

if n = 0. 


( 6 ) 


The other updates (e.g. for Tj \... and Mj \...) are the same as in the stan¬ 
dard BART algorithm of Section 2.2. The difference is that now the response 
is the latent variable Z^, from which residuals Rj can be calculated for up¬ 
dating of individual trees. 


2.4 Issues with current tree-based methods 

RF and BART are both popular and useful ensemble methods which not only 
provide a measure of accuracy but also a variable importance score which can 
be used for variable selection. 

A major advantage of RF is that it has a fast running time and can be ap¬ 
plied to high-dimensional datasets on a standard laptop computer. However, 
a disadvantage is that the method does not provide a principled assessment 
of uncertainty about the prediction. Only point predictions of Y are given 
with no estimate as to the variability of these predicted values. 

BART on the other hand is a fully specified Bayesian model and so can 
provide estimates of model and predictive uncertainty. However, it becomes 
prohibitively memory intensive when used on high-dimensional data. Two 
main bottlenecks were noted in the BART model as described in Section 2 
where p is large. The first is that using a uniform prior to choose predic¬ 
tive split rules in each internal node of each tree produces chains with high 
rejection rates for large p. Thus the MCMC algorithm becomes inefficient, 
especially if the number of truly predictive variables is small. Another issue 
for high-dimensional data is that the BART algorithm is very memory hun¬ 
gry when the length of the chains and the size of each sum of trees are large 
(a by-product of p being large). This is because each tree model for each iter¬ 
ation of each MCMC chain must be saved to memory if needed for prediction 
of an external dataset. All of these issues are addressed by BART-BMA. 


3 BART-BMA 

BART-BMA is a Bayesian CART ensemble model, which like BART uses a 
sum of trees model. BART-BMA can be seen as a bridge between RF and 
BART in that, like RF it can be applied to high-dimensional datasets on a 
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standard laptop, but it is also a model-based sum of trees method like BART 
and therefore can provide estimates of predictive uncertainty. 

BART-BMA departs from the original BART model in a number of ways. 
First BART-BMA performs a greedy search for predictive split rules and only 
grows sum of trees models based on this set of most predictive splits. Then, 
rather than using MCMC, BART-BMA uses an efficient implementation of 
Bayesian Model Averaging which averages over multiple models. Each model 
in BART-BMA consists of a sum of trees in a similar vein to BART. The 
algorithm for BART-BMA is shown in Algorithm 2. 

3.1 Finding Good Split Rules 

BART-BMA uses a greedy search method for Ending predictive split points. 
The alternative of uniformly proposing split rules for tree internal nodes, used 
by BART, becomes very slow as the number of variables increases. It will 
be particularly inefficient when the number of variables associated with the 
response is low, as in many bioinformatics studies. With high-dimensional 
data we argue that a greedier search should be adopted in order to focus 
the algorithm towards predictive splitting rules. BART-BMA proposes the 
use of a change point algorithm called Pruned Exact Linear Time (PELT) 
as proposed in Killick et al. (2012) as well as a grid search method, both of 
which we now discuss. 


3.1.1 PELT 

Searching for predictive split points for a single variable in a tree is equivalent 
to searching for change points in a univariate stochastic process. Because of 
this, BART-BMA uses a change point detection algorithm called Pruned Ex¬ 
act Linear Time (PELT) to search greedily for predictive split points (Killick 
et ah, 2012). PELT was originally proposed to detect change points in time 
series data by minimising the function 


min 

s 


m+1 

[C{y{Si_i + l):s) + D] 

. ^=1 


(7) 


where C{-) is a cost function (such as the negative log likelihood) of each seg¬ 
ment i containing observations y{Si_i+i ):5 , and H is a penalty for adding addi¬ 
tional change points. PELT extends upon the optimal partitioning method 
of Yao (1984) by eliminating any change points which cannot be optimal. 
This is achieved by observing that if there exists a candidate change point 
s where S < s < S which reduces the overall cost of the sequence then the 
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change point at 6 can never be optimal and so is removed from consideration 
(Killick et ah, 2012). 

BART-BMA implements PELT as an efficient algorithm to hnd good split 
rules for each tree in the model. This function detects changes in the mean 
or variance of the tree response variable Rj with respect to each variable Xp 
in turn, where each change point is treated as a potential splitting rule to 
grow a tree. The subset of the best numcp% split points and split variables 
are chosen on the basis of its residual squared error. Our experience of 
testing BART-BMA on a number of different datasets is that a value of 
H = 10 log(n) performs well as a general default for the PELT penalty when 
the number of observations n < 200. For larger values of n we recommend 
using a higher value for D or the grid search option instead (see Section 3.1.2). 
BART-BMA implements a version of PELT in which is equivalent to the 
PELT.meanvar.norm function from the changepoint package in R (Killick 
et ah, 2014). This function searches for changes in the mean and variance of 
variables which are assumed to be normally distributed. Additional change 
points are accepted if there is support for their inclusion according to the log 
likelihood ratio statistic. 

3.1.2 Grid Search 

An alternative search algorithm offered by BART-BMA is the grid search. 
Here each variable Xp in dataset X is split into grid.size -|- 1 equally spaced 
partitions within the range of Xp. Each partition value is then used as a 
potential split point and the best numcp% split rules in terms of the residual 
squared error are chosen. Increasing the grid.size parameter to a large 
number is not recommended where both n and p are large, as BART-BMA 
may become prohibitively slow. The value grid_size = 15 led to good 
performance in most of the cases we implemented. 

3.1.3 Updating Split Rules 

By default BART-BMA chooses the best numcp% of the total split rules 
before the tree is grown and only trees using the most predictive split rules 
are considered for inclusion. However the best split rules can also be updated 
for each internal node in the tree, similarly to how RE creates trees. We 
have found that updating split rules at each internal node generally results 
in smaller sum of trees models, but each tree within the sum tends to be 
larger and to choose splits that are similar to the primary splits of trees in 
the RE. We have found that updating the split rules at each internal node can 
in some cases increase the predictive accuracy, but generally at the expense 
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of computational speed. 


3.2 Terminal Node Prior 

BART-BMA uses a different prior on the fiij parameters to that used in 
BART as described in Section 2.2. We use the standard conjugate prior sug- 
gested by Chipman et al. (1998) and assume (T~iV(0, ^). The reason 
BART-BMA chooses a different prior on Hij to BART is for computational 
simplicity as both the terminal node mean and the model precision r = 
can be marginalised out of the calculation of the tree likelihood, as shown 
in (8). This means that the model precision does not need to be updated 
through MCMC as in BART (see (3)). 


3.3 BART-BMA Sum of Tree Likelihood 


Unlike BART which calculates the likelihood for each individual Bayesian 
CART tree in the model, BART-BMA specihes the likelihood for each sum 
of trees in its model as follows: 


p(U|X,T)oc 


z/A - {Y^W){W^W + aI)-\W^Y) + Y'^Y 


n-\-u)-\-u 

2 


( 8 ) 


Here we dehne Jj for each Bayesian CART tree j = 1... m in the sum of trees 
model as an n X bj binary matrix where the elements of Jj denote inclusion 
of observation k = 1.. .n in terminal node i = 1... 6 of tree j. We let W = 
[ Ji... Jm\ be an n X ca matrix, where u = ^ 

be a vector of size u of terminal node means assigned to trees Ti... T^. We 
then set Y ~ N(WO,t~^) and O ~ iV(0, ^/). The parameters z/, A and r 
are specihed as described in Section 2.2. As in Chipman et al. (2010) we shift 
and scale the response variable to have mean 0 and standard deviation 1 so 
Y^Y = n. When m = 1, (8) is equivalent to the full conditional distribution 
for an individual Bayesian CART model, as described by Chipman et al. 
(1998). 


3.4 BART-BMA Method 

BART-BMA grows Bayesian CART trees in the same manner as described 
in Chipman et al. (1998). The algorithm begins with a tree stump and only 
grows trees with high posterior probability. 

For each tree in BART-BMA, split points are chosen for each variable in 
X using one of the greedy search algorithms described in Section 3.1. BART- 
BMA starts by setting the current_tree_list to a stump tree which has all 
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the data in one terminal node. It then iteratively grows the tree model Thu 
for each terminal node i in tree ^ using each split rule h. 

For each tree model in BART-BMA, the posterior probability of the sum 
of trees containing which we refer to as SThu, is approximated using 
the BIC, dehned as 

BIC = -2(log(p(F|X, STue)) + \og{p{STMe))) + B log(n). (9) 

(Schwarz, 1978). In (9), p(Y\X, SThu) is the tree likelihood for the sum of 
trees model containing Thu as described in Section 3.3, B is the number of 
parameters in the model, and \og{p{STfiie)) refers to the log tree prior for 
the sum of trees model. Here we use the same regularisation prior on the 
tree size and shape as Chipman et al. (2010) (see Section 2.2). Like BART, 
BART-BMA seeks to keep individual tree components small so that no one 
tree can dominate the overall model. Thus high prior probability is placed on 
trees with < 5 terminal nodes, as in Chipman et al. (2010). To calculate the 
prior probability of a given tree Thu, we set the prior p(Thie) = nM=i Psplit 
for each internal node u in the tree. The prior for is then YY^^iPiThu)- 

As the terminal node mean and variance parameters have been integrated 
out of (8), the only parameters remaining in the calculation of the tree like¬ 
lihood are the split variable and the threshold at which that variable is split. 
As a split variable and split point are chosen at each internal node in the 
tree model, B is equal to twice the number of internal nodes in the sum of 
trees model. 

Trees are grown to a maximum user-specihed depth and only splits that 
have at least 5 observations in the child nodes are retained, in order to avoid 
growing large trees with nearly empty nodes. Once a tree has been grown, the 
predicted values for observations falling in each terminal node R^ij are set to 
the mean of the complete conditional of p{phij \...) as shown in Section 3.4.4. 

3.4.1 Occam’s Window 

In order to focus the algorithm on highly probable tree models, Bayesian 
model averaging (BMA) is used. BMA takes account of model uncertainty 
by averaging over the posterior distribution of many models instead of fo¬ 
cusing on the most probable one. Models are averaged over and weighted 
according to their posterior probabilities. In this way BMA takes account of 
the variability due to model selection. 

As it is not possible to perform an exhaustive search of the model space 
especially when p is large, we use a greedy and efficient version of BMA called 
Occam’s Window (Madigan and Raftery, 1994). Here only the best subset 
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of models which fall within Occam’s Window are averaged over using (10). 

\og{BIC() — argmin^(log(il/0)) < log(o). (10) 

As BART-BMA searches the model space, it approximates the posterior 
probability of each tree using the BIG (9). The lowest (best) BIG of any 
model encountered so far is saved and any sum of trees whose BIG falls 
within a given threshold o of the best model is saved, while those outside of 
Occam’s Window are discarded. Hence only those models for which there is 
high support from the data are maintained and those whose predictions are 
considerably worse than the best model are eliminated from consideration. 
Our experience has led us to use o = 1, 000 as a general default value. 

For each sum of trees model initially accepted in Occam’s Window, the 
residuals are calculated and the process described in the previous paragraph 
is repeated using the partial residuals of each tree as the new tree response 
variable. This process of iteratively htting trees is repeated until either a 
user-specihed number of iterations is reached or no more trees are accepted 
in Occam’s window. It should be noted that as models with better (lower) 
BIG are added to Occam’s Window, this may eliminate some models which 
were previously within Occam’s Window. Thus Occam’s Window constantly 
updates the best list of sum of trees models as the algorithm proceeds. We 
have found that iteratively htting trees with hve trees in each sum of trees 
model generally performs well. 

Once BART-BMA has selected the set of sums of trees within Occam’s 
window, the predicted response values are then calculated as a weighted 
average of the predicted values from the selected sum of trees models. Each 
sum of trees model is weighted by its approximate posterior probability, 
Wi/ where is the model weight for model i, dehned as 

Wi = exp (—0.5 BIG^ — v ), (11) 

where v = max;(—0.5 BIG^). The BART-BMA algorithm is detailed in Al¬ 
gorithm 2. 


3.4.2 Variable Importance 

Like BART, BART-BMA provides a principled variable importance score, 
which is simply the estimated posterior expectation of the number of splitting 
rules involving the variable. For each sum of trees model i with posterior 
probability wg as calculated in (11), let Kpe be the number of splitting rules 
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Algorithm 2: BART-BMA for continuous response 


Input: n X p matrix X with continuous response variable Y 
Output: RMSE,Credible interval for Y, after burn in updates for a 
Initialise: TreeJiesponse = Y_scaled] numjtrees.currentjround = 1 
Initialise: current JreeJist = previous JreeJist = a tree stump 
Initialise: lowest-BIC 
for jR— 1 to m do 

for i ^ 1 to numJreeswurrentwound do 

1. Find Good Split Rules: 

Run greedy search to hud numcp best split 
rules for each tree in the previous .tree .list, 

save to matrix splitting .rule 

2. Grow Greedy Trees in Occam’s Window 

For R ^ 1 to max.tree.depth 

{ 

Grow tree T* using each split rule in 
splitting .rule for each terminal node in each 
tree in previous.treeJist 

If Sum of trees including T* is in Occam’s 
Window 

Append T* to cur rent.tree .list 

} 

3. Make sum of trees models and update 
residuals 

• Set previous.tree.list=current.tree.list 

• Make list of sum of trees and save those in 
Occam's Window 

• Update residuals for each sum of trees model 
and save to Tree.Response matrix 

• lowest.BIC = min(BlC of trees in 
sum.tree.list) 


Y =Sum of weighted predictions over all sum of trees 
models 

. 15 

Implement post-hoc Gibbs Sampler for each sum of trees 
accepted in Occam’s Window 

return: 

Credible intervals for Y ; Sum of trees in Occam’s Window; 
Posterior probability of each sum of trees model 


4. 

5. 






containing variable Xp in model The variable importance score used by 
BART-BMA is then 


Imp(a;p) = —. (12) 

^p=l ^£=1 

3.4.3 Posthoc Gibbs Sampler 

In order to provide credible intervals for the point predictions, Y, provided 
by BART-BMA, a Gibbs sampler is run in a similar fashion to BART. For 
each sum of trees model accepted by BART-BMA a separate chain in the 
MCMC algorithm is run. Each terminal node parameter in each tree 
Tj is then updated followed by an update of a. The details of the updates 
for the complete conditional of p(/ijj|Tj, Rj, and of p(cr^) are given in the 
following sections. The Gibbs sampler yields credible intervals for each set 
of sum of trees models accepted by BART-BMA along with the updates for 
r = for each set of trees accepted in the hnal BART-BMA model. The 
hnal simulated sample from the overall posterior distribution is obtained by 
selecting a number of iterations from the Gibbs sampler for each sum of 
trees model proportional to its posterior probability, and combining them. 
The post-hoc Gibbs sampler used by BART-BMA is far less computationally 
expensive than that of BART as it requires only an update for jUij and a from 
the complete conditional of each sum of trees model, which is merely a draw 
from a normal distribution and an inverse-Gamma distribution respectively 
(see Sections 3.4.4 and 3.4.5 respectively). 

3.4.4 Update of p(MjjTj, R^ji, a^) 

Let /Uij G Mj index the b terminal node parameters of tree Tj, and be 
the partial residuals for observations l belonging to terminal node i used as 
the response variable to grow tree Tj. BART-BMA assumes that the prior on 
terminal node parameters is Pij\Tj, a^N{0, —), as in Ghipman et ah (1998). 
The prior distribution of the partial residual is Rj\ N{pij,a‘^). 

The full conditional distribution of Mj is then 

p{Mj\Tj, Raj, a) oc p{Raj\Tj, Mj, a)p{Mj\Tj) 

rii 

oc Wp{Raj\Tj, Mj, (T)p{Mj\Tj), (13) 

i=i 

where l indexes the observations within terminal node i of tree Tj and 
refers to the number of observations which fall in terminal node i. 
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The draw from the full conditional of p{Mj \...) is then a draw from the 
normal distribution 


Mj\Tj,R,ij,ar^N 


Erii R. 


it] 


a" 


Tli T (X Til T ^ 


(14) 


The full conditional of Mj \... depends only on a in the variance parameter, 
making it slightly more efficient than the update of Mj using the BART prior 
which depends on a in both the mean and variance parameter. 


3.4.5 Update of 

BART-BMA performs the update for p{a) in the same way as Chipman et ah 
(2010). The full conditional distribution of is: 

n 

p{(T^\Rj, Tj, Mj) oc JJp {Rj\Tj, Mj, a^) p [a^) , (15) 

k=l 

where Rj ^ (EJLi dixk, Tj, Mj), 0-2 j and Gamma(((’, 77 ), where ^ and 
p are equal to | and respectively. 

BART-BMA makes the draw for in terms of the precision r = R where 
p{T\Rj,Tj, Mj) is calculated as: 

T\Rj, Tj, Mjr^ Gamma 2 ’ ^ 


1 2 


The next value of r is then drawn 


where P = Efc Tk - J2j di^k, R, Mj) 
from (16) and the value of a is calculated by getting the reciprocal square 
root of that value. 


3.5 BART-BMA for Classification 


BART-BMA can also be used for binary classihcation. We follow the same 

strategy as Section 2.3 by introducing the latent variables Zk 

with 


Yk 


1 if Zfc > 0 
0 otherwise, 


as before. 
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Our method requires estimates of so that the previously introduced 
BART-BMA algorithm for continuous responses can be run without modi- 
hcation. We simply £x the for all k at the start of the algorithm. In 
practice we have found that this approach works well if we set all 
Zk = $“^(0,001) ~ —3.1 if = 0 and = $“^(0,999) if yu = 1- 

Once the Z^ values are set, BART-BMA uses these as the new response 
for the tree, updates the residuals Rj and iteratively fits trees as before. In 
order to set predicted Rj values in the terminal nodes, BART-BMA uses the 
mean of the full conditional for Mj, as given by (14). As in the continuous 
case this prior will shrink the terminal node means yij towards zero thus 
ensuring that no one tree can dominate the model. 

4 Results 

In this section we compare BART-BMA to RF and BART for a number of 
simulated datasets and also to two real proteomic data sets for the diag¬ 
nosis of cardiovascular disease and aggressive versus non-aggressive prostate 
cancer. 

All results reported in this section were run on a HP Z420 Workstation 
with 32GB RAM. BART was run using the bartMachine R package (Kapel- 
ner and Bleich, 2014b) and RF was run using the randomForest R package 
(Liaw and Matthew, 2015). All methods and results were obtained using R 
version 3.2.0. Default values for all model parameters were used for each 
method. 

4.1 Friedman Data 

As in Chipman et al. (2010) we use simulated data based on Friedman (1991) 
to compare the results of BART-BMA, RF and BART. The original simulated 
dataset of Friedman (1991) had 5 uniform predictor variables xi.. .x^ where 

y = 1 Osin(Trxi 0 : 2 ) + 20 ( 0:3 — 0.5)^ -|- 10 x 4 + Sxs -|- e. (17) 

In order to see how the three methods compared over various dataset 
sizes, five datasets were constructed by appending random noise variables 
to the truly important variables shown in (17) above, such that the total 
number of variables in each data set was p = (100,1000, 5000,10000,15000) 
where Xi...Xp are uniform random variables and e~A^(0,l). Hence each 
method was compared across 5 datasets using five-fold cross-validation and 
the cross-validated root-mean squared error (RMSE) was recorded along with 
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the CPU time taken to run each method. For the Friedman datasets BART- 
BMA results are shown using the GRID search, as in all cases the number of 
observations is n = 500 (see Section 3.1.1 for default recommendations). 

Figure 1 shows the results for the model accuracy (Figure la) and CPU 
time (Figure lb) of the three methods for the simulated Friedman data. As 
can be seen, BART performed best with regards to RMSE for small datasets 
where p < 1,000 (Figure lA). However, once the number of variables was 
increased past p > 5, 000 BART-BMA outperformed both RF and BART. In 
fact BART-BMA outperformed RF and BART for all cases where p > 1, 000 
regardless of the method used for Ending good splits or whether the split 
rules were updated before the tree or during the tree growing process. 

With 5,000 variables, BART required a minimum of 10GB of RAM to 
run its default model, which is already beyond the capabilities of a current 
standard laptop. When p = 10, 000 BART required a minimum of 22GB of 
RAM, and for p = 15000 it required a minimum of 30GB of RAM which 
would necessitate a specialised computer. Unlike BART, BART-BMA does 
not have any special memory requirements and will run on a standard laptop 
with 4GB of RAM for all five cases shown here. 

With regard to model speed it can be seen from Figure IB that for all 
5 datasets BART-BMA was considerably faster than RF and was also more 
accurate for all but the case where p = 100. BART performed the fastest for 
all cases where p > 1000, but its model accuracy for these datasets was far 
worse than either BART-BMA or RF. 

In order to assess how well calibrated the prediction intervals from BART- 
BMA and BART are, we show the average coverage of the in-sample 95% 
prediction intervals and the average interval width for each of the simulated 
datasets in Table 1. RF does not provide confidence intervals and so could 
not be included. If the method is calibrated, on average we would expect 
95% of the intervals to contain the true value of y. From Table 1 it can be 
seen that BART-BMA is the best calibrated across the 5 simulated datasets 
shown, as its coverage is much closer to 95% than BART in all cases. 

The right hand side of Table 1 shows the average interval width for both 
BART-BMA and BART. Here it can be seen that for small datasets where 
p < 1, 000 BART had a shorter interval width, but this should be seen in the 
context of the fact that BART is also not well calibrated. For larger datasets 
where p > 1000 however, BART-BMA has a much shorter mean interval 
width than BART and had more accurate coverage. Overall BART-BMA 
was more accurate and better calibrated than BART, and needed much less 
computation than BART for data sets where p > 1000. 
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Friedman Data 


RMSE Time Seconds 




-■*- RF BART BART-BMA 


Figure 1: Friedman example: Comparison of RMSE and CPU run 
time in seconds for the 5 simulated Friedman datasets where p = 
100,1000, 5000,10000,15000. As n = 500 the GRID method was used to 
search for the subset of best splits. It should be noted although BART ran 
in the fastest time for p > 5000 it performed considerably worse than RF 
and BART-BMA for these datasets with respect to RMSE. 
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Table 1: Friedman example: coverage for in-sample 95% prediction intervals 
and average interval width for BART-BMA and BART (RF does not provide 
confidence intervals and so is not included). Perfect calibration is 95% hence 
the model with the lowest average interval width and a coverage as close to 
95% as possible is most desirable. Items in bold refer to the best calibrated 
model with respect to interval coverage and interval width for each simulated 
dataset. _ 


p 

Coverage 

Avg Int Width 

BART 

BMA 

BART 

BART 

BMA 

BART 

100 

97.0% 

99.7% 

11.73 

7.19 

1000 

97.4% 

99.9% 

11.69 

9.90 

5000 

96.5% 

100% 

11.67 

15.39 

10000 

96.4% 

100% 

11.66 

17.46 

15000 

96.4% 

100% 

11.68 

18.07 


4.1.1 Friedman Data: Variable Importance 

We now analyse the variable importance scores for each method. As these 
data are simulated, it is known that variables xi.. .x^ are truly important 
and all other variables are random noise. As BART and BART-BMA pro¬ 
vide variable inclusion probabilities, the decrease in Gini impurity provided 
by the RF was converted into a probability in order to allow for compari¬ 
son between the variable importance scores assigned to each variable. The 
average importance for variables xi.. .x^ over the 5 cross-validation folds is 
reported in Figure 2. 

BART-BMA had much larger average variable importance scores than RF 
and BART for variables Xi and X 2 for all values of p. It also had much larger 
variable importance scores for xs when p > 1,000. For x^, BART-BMA 
had substantial variable importance scores, as did RF. For xs, BART-BMA 
had larger variable importance scores than the other methods for p < 1,000, 
and all three methods had low variable importance scores for p > 1,000. 
BART had strikingly low variable importance scores for the truly important 
variables, regardless of the numbers of noise variables. 

Table 2 shows the sum of the variable importance scores assigned to the 
random variables Xq ... Xp. Here it can be seen that across all hve datasets 
BART-BMA correctly selected only those truly important variables in its 
model and never included any of the random noise variables. RF also con- 
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Friedman Variable Importance Scores 



RF BART BART-BMA 


Figure 2: Friedman example: variable importance scores for the truly impor¬ 
tant variables xi.. .x^ for each of the 5 Friedman datasets. As u = 500 the 
GRID method was used to search for the subset of best split rules. BART- 
BMA and BART scores show the mean variable inclusion probability, RF 
scores show the mean decrease in Gini index expressed as a probability. 
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Table 2; Friedman example: Sum of the Variable Importance Score assigned 
to random variables Xq ... Xp. The method which assigns the lowest impor¬ 
tance to random variables is considered the best. Items in bold show the 
best method for each of the hve simulated datasets._ 


p 

BART-BMA 

RF 

BART 

100 

0 

0.35 

0.63 

1000 

0 

0.45 

0.81 

5000 

0 

0.50 

0.92 

10000 

0 

0.52 

0.95 

15000 

0 

0.53 

0.97 


Table 3: Friedman example: Brier score = ^ '^p=i{Ip~VISpY where Ip = I 
for truly important variables Xi.. .x^ and Ip = 0 otherwise. Hence the lower 
the Brier score the better the model variable selection. Items in bold show 
the best model with respect to the Brier score. 


P 

BART-BMA 

RF 

BART 

100 

3.24 X 10-2 

3.82 X 10-2 

4.30 X 10-2 

1000 

3.26 X 10-^ 

4.00 X 10-3 

4.62 X 10-3 

5000 

6.55 X 10-^ 

8.18 X 10-^ 

9.68 X 10-^ 

10000 

3.28 X 10-^ 

4.13 X 10-^ 

4.89 X 10-^ 

15000 

2.18 X 10-^ 

2.76 X 10-^ 

3.29 X 10-^ 


siderably outperformed BART with regards to variable selection and assigned 
far lower importance to the random noise variables across all hve datasets. 

In order to further assess the overall quality of the variable importance 
scores assigned across all variables we used a Brier score = ~ 

VISpY where Ip = 1 for truly important variables X 1 ...X 5 and Ip = 0 
otherwise. Hence the model with the lowest Brier score is considered the 
best. Brier scores for all 5 simulated datasets are shown in Table 3. Here 
it can be seen that the variable importance scores from BART-BMA gave 
the best overall performance across all 5 datasets assessed and also that RF 
outperformed BART for each of the datasets with regards to the Brier score. 
Overall it can be seen that BART-BMA outperformed RF and BART with 
regards to the quality of its variable selection and did not include any random 
noise variables in its model regardless of the size of the dataset. 
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4.2 Prostate Cancer Data 


Prostate cancer is a heterogeneous disease. In some men it manifests itself 
as an acute, aggressive and rapidly advancing condition, and in other men as 
a slow disease that is responsive to existing treatment regimes for signihcant 
periods of time. It is widely recognized that existing methods to classify 
the grade of the disease (using serum PSA levels, digital rectal examination 
and Gleason score) are not well suited for monitoring its progression nor 
establishing the optimal timing of treatment interventions (Logothetis et ah, 
2013). It is therefore important to be able to distinguish between aggressive 
and nonaggressive forms of the disease in a timely manner. 

Here we show the results of an experiment where the expression levels 
of 51 peptides were measured using multiple reaction monitoring (MRM) on 
63 patients with prostate cancer. Of these patients, 21 had extra capsular 
extension which is a surrogate for aggressive disease, while the remaining 
42 had localised disease which had not spread beyond the boundaries of the 
prostate gland. 

In order to access model performance we use the precision recall curve 
(Davis and Goadrich, 2006). The precision of a model is another term for 
the positive predictive value and measures the 

ETrue pSuves+gSse Positives ' ^ere it is measuring the probability of a pa¬ 
tient having extra capsular extension given that the model predicted they 
had extra capsular extension. The recall of a model is another name for the 
specihcity and measures 

— E True N^^atives —^ recall is measuring the proba- 

bility of a patient being diagnosed as extra capsular extension by the model 
given they actually had extra capsular extension. The Precision Recall Gurve 
shows the precision of a model over varying thresholds of the recall and the 
area under this curve is known as the Area under the Precision Recall Gurve 


(AUPRG), the higher the AUPRG of a model the better the model. 

Table 4 shows the comparison of BART-BMA, RF and BART for this 
dataset in terms of the classihcation rate, the AUPRG and the GPU time 
in seconds. BART-BMA using the PELT search performed best in terms of 
classihcation accuracy, identifying 79% of the cases correctly. It also per¬ 
formed best in terms of AUPRG with an area of 0.68. Here the PELT rather 


than the GRID search was used as the sample size for this experiment was 
small (n < 200). 

Table 5 shows the hve most important variables chosen across the three 
models. As can be seen here BART-BMA, RF and BART agreed that vari¬ 
ables number 50 and 18 were the two most important for this dataset. RF 
and BART both chose variables 50, 18, 2 and 3 in their top hve. BART-BMA 
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Table 4: Prostate cancer data: Classification rate, area under the precision 
recall curve (AUPRC) and CPU run time in seconds (standard deviation in 
brackets) for BART-BMA, RF and BART. Methods with higher classification 
rate, AUPRC and lower CPU running time are more desirable. Elements in 
bold show the best method with respect to each of the criteria. 



BART-BMA 

RF 

BART 

Classification 

Rate 

0.79 

0.71 

0.71 

AUPRC 

0.68 

0.67 

0.65 

Time in 
Seconds 

1.02 (0.06) 

0.07 (0.01) 

1.28 (0.61) 


tends to assign higher inclusion probabilities to a smaller number of variables 
than RF or BART which both tend to assign lower inclusion probabilities 
across a larger number of variables. In this case BART-BMA assigned a 
high variable importance score to variable 50 showing that this variable was 
present in 22.5% of the total splits across sum of trees models. If variables 
were chosen at random we would expect each variable to have an inclusion 
probability of 0.019 for this dataset and so across all cases a higher than 
random importance was assigned to the most important variables. 

4.3 Cardiovascular Disease 

This section describes an experiment to distinguish patients with cardiovas¬ 
cular disease from control. This experiment was undertaken on 498 patients, 
150 of whom had a cardiovascular disease and 348 of whom were healthy. A 
total of 36 proteins were measured by a targeted approach (MRM) in each 
patient sample. Table 6 shows the results for this experiment with respect to 
the classification rate, the AUPRC and the CPU time in seconds to run each 
method. BART-BMA was slightly more accurate than either RF or BART 
and correctly predicted 70% of the patients. With respect to the AUPRC 
however, BART-BMA did not perform as well as the other methods (see: 
Table 6) with BART performing the best by this measure. BART-BMA was 
quite fast in this instance and ran in 0.43 of a second, considerably faster 
than BART (4.82 seconds). 

Table 7 shows the five most important variables chosen by each of the 
methods. As can be seen, all three methods chose quite similar proteins in 
this case with BART-BMA and RF agreeing on 4 of their top 5 and BART 
agreeing on three of its top 5. Again it can be seen that BART-BMA tends to 
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Table 5: Prostate cancer data: Top five most important variables for each 
method. BART-BMA and BART scores show the mean variable inclusion 
probability, RF scores show the mean decrease in Gini index expressed as a 
probability. As n < 200 the PELT method was used to search for the subset 
of best split rules. 


Variable 

BART-BMA 

RF 

BART 

50 

0.225 

0.082 

0.025 

18 

0.168 

0.052 

0.024 

2 


0.035 

0.021 

3 


0.042 

0.022 

4 



0.021 

30 

0.103 



31 

0.071 



25 

0.050 



44 


0.037 



Table 6: Cardiovascular disease data: Classification rate, area under the 
precision recall curve (AUPRC) and CPU run time in seconds (standard 
deviation in brackets) for BART-BMA, RF and BART. Methods with higher 
classification rate, AUPRC and lower CPU running time are more desirable. 
Elements in bold show the best method with respect to each of the criteria. 



BART-BMA 

RF 

BART 

Classification 

Rate 

0.70 

0.69 

0.69 

AUPRC 

0.43 

0.46 

0.49 

Time in 
Seconds 

0.43 (0.22) 

0.59 (0.04) 

4.82 (0.72) 
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Table 7: Cardiovascular disease data: Top five most important variables for 
each method. BART-BMA and BART scores show the mean variable inclu¬ 
sion probability, RF scores show the mean decrease in Gini index expressed 
as a probability. As n > 200 the GRID method was used to search for the 
subset of best split rules. 


Variable 

BART-BMA 

RF 

BART 

14 

0.33 

0.05 

0.04 

4 

0.16 

0.03 


24 

0.14 

0.05 

0.03 

3 

0.06 

0.03 

0.03 

34 

0.06 



2 



0.03 

10 



0.03 

13 


0.03 



assign a higher inclusion probability to a smaller set of proteins where as RF 
and BART tend to spread the probability across a larger number of variables. 
If the model were assigning probabilities uniformly across variables we would 
expect a probability ~ 0.03 which is the probability assigned to 4 of the top 5 
variables included by BART and two of those included by RF. BART-BMA 
in all cases assigned a much higher than random inclusion probability to its 
most important variables. 


5 Conclusions 

We have proposed a Bayesian tree ensemble method called BART-BMA 
which modifies the BART method of Chipman et ah (2010) and can be 
used on datasets where the number of variables is very large. Instead of esti¬ 
mating the tree node parameters using MCMC, BART-BMA uses a version 
of Bayesian model averaging which is more memory efficient, and generally 
more accurate when the number of variables is large, as only a subset of the 
best models are averaged over. Ghanging the model priors to those used in 
Ghipman et al. (1998) means that the model precision is needed only for 
calculation of the prediction credible intervals and not in the calculation of 
the likelihood. As such, a fast posthoc Gibbs sampler can be run, yielding 
estimates of predictive uncertainty in addition to point predictions. 
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BART-BMA proposes an efficient strategy for finding good split rules 
which works particularly well for high-dimensional data where the uniform 
prior used by BART becomes prohibitively computationally intensive. BART- 
BMA borrows elements of both BART and RF in that it is a sum of trees 
ensemble model which averages over multiple sums of trees and as such offers 
a model-based alternative to machine learning methods for high-dimensional 
data where BART is not feasible. 

BART-BMA like RF not only provides a variable importance score but 
also provides access to the trees chosen in the final model. In general BART- 
BMA tends to choose shallower and more interpretable trees than RF as only 
splits which result in a high posterior probability are included. 

We have showcased BART-BMA using both simulated and real life pro- 
teomic datasets and have shown its usefulness for high-dimensional data 
where BART will require a specialised computer or not run at all. We have 
found that BART-BMA is competitive with its main competitors RF and 
BART in terms of speed and accuracy and will run on a current standard 
laptop (4-8GB RAM). 

We envisage future applications and extensions for BART-BMA including 
dealing with missing data as well as for use on longitudinal data and extend¬ 
ing this model to the family of generalised linear models. BART-BMA could 
also be parallelised to reap further gains in computational speed. 
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